Risk Factors and Predictive Modeling for Post-Acute Sequelae of SARS-CoV-2 Infection: Findings from EHR Cohorts of the RECOVER Initiative

Background Patients who were SARS-CoV-2 infected could suffer from newly incidental conditions in their post-acute infection period. These conditions, denoted as the post-acute sequelae of SARS-CoV-2 infection (PASC), are highly heterogeneous and involve a diverse set of organ systems. Limited studies have investigated the predictability of these conditions and their associated risk factors. Method In this retrospective cohort study, we investigated two large-scale PCORnet clinical research networks, INSIGHT and OneFlorida+, including 11 million patients in the New York City area and 16.8 million patients from Florida, to develop machine learning prediction models for those who are at risk for newly incident PASC and to identify factors associated with newly incident PASC conditions. Adult patients aged 20 with SARS-CoV-2 infection and without recorded infection between March 1st, 2020, and November 30th, 2021, were used for identifying associated factors with incident PASC after removing background associations. The predictive models were developed on infected adults. Results We find several incident PASC, e.g., malnutrition, COPD, dementia, and acute kidney failure, were associated with severe acute SARS-CoV-2 infection, defined by hospitalization and ICU stay. Older age and extremes of weight were also associated with these incident conditions. These conditions were better predicted (C-index >0.8). Moderately predictable conditions included diabetes and thromboembolic disease (C-index 0.7–0.8). These were associated with a wider variety of baseline conditions. Less predictable conditions included fatigue, anxiety, sleep disorders, and depression (C-index around 0.6). Conclusions This observational study suggests that a set of likely risk factors for different PASC conditions were identifiable from EHRs, predictability of different PASC conditions was heterogeneous, and using machine learning-based predictive models might help in identifying patients who were at risk of developing incident PASC.


Introduction
The global COVID-19 pandemic starting in late 2019 has led to more than 557 million infections and 6.4 million deaths as of July 14, 2022. 1 Growing scienti c and clinical evidence has demonstrated the existence of potential post-acute and long-term effects of COVID-19, which affect multiple organ systems 2 and are referred to as post-acute sequelae of SARS-CoV-2 infection (PASC). Recently there have been several retrospective cohort analyses identifying potential PASC using real-world patient data 3(p19),4,5(p19) . However, research on the predictability of PASC and their associated risk factors is still limited, and mixed results have been reported. Such predictive modeling research can help patients and healthcare professionals to recognize the risk of PASC early and inform effective actions. Several studies found older age, higher severities in the acute phase of SARS-CoV-2 infection 6 , and pre-existing conditions (e.g., hypertension, obesity) may be associated with a higher risk of developing PASC. [7][8][9][10][11][12] By contrast, some studies also reported that baseline clinical characteristics or demographics were not associated with PASC. 10 Two main challenges may explain these seemingly con icting ndings: 1) Prior studies have typically been conducted using patient cohorts with small sample sizes including only a few hundred 13 or thousand 8 patients, limiting the signi cance and generalizability of conclusions derived; and 2) PASC conditions are highly heterogeneous [14][15][16] , thus their predictabilities and associated risk factors could be heterogeneous as well.
To ll in the knowledge gap and address these challenges, we conducted a systematic study on the predictability of a broad spectrum of incident PASC conditions and their associated risk factors. We used two large electronic health records (EHR) cohorts from the PCORnet clinical research networks (CRN) 17 , namely INSIGHT 18 , covering patients in the New York City (NYC) area, and OneFlorida + 19 , including patients from Florida. The INSIGHT and OneFlorida + were used as primary analyses and validation respectively. A wide range of PASC conditions were selected based on our previous ndings using a rigorous data-driven analysis pipeline 14 and other existing evidence or clinical knowledge (See the method section for a detailed list of PASC diagnoses). We developed machine learning-based prediction models to identify patients who were more likely to develop particular incident PASC conditions with their baseline characteristics and acute severity according to medical utilization. We compared the performance of machine learning models with different levels of complexity, including regularized Cox proportional hazard model, regularized logistic regression, gradient boosting machine, and deep neural network in both the survival analysis setting and binary classi cation setting, as well as examined the potential risk factors for different PASC conditions after removing background associations. We observed that within the broad range of PASC, a pattern of post-severe acute disease-associated conditions was reliably predictable. Decreasing the burden of severe disease will likely improve these outcomes. However, a variety of PASC conditions were less predictable and were less associated with upfront disease severity. This lack of predictability may represent a challenge as the burden of severe disease decreases. This study is part of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative, which seeks to understand, treat, and prevent the post-acute sequelae of SARS-CoV-2 infection (PASC).

Data
This study used two large-scale de-identi ed real-world EHR datasets from the INSIGHT Clinical Research Network (CRN) 18 and the OneFlorida + CRN 19  De nition of Post-acute Sequelae of SARS-CoV-2 (PASC) We examined a list of potential PASC conditions as outcomes, including depressive disorders, anxiety disorder, general PASC symptoms and signs with ICD codes U099/B948, fever, malaise and fatigue, dizziness, malnutrition, uid disorders, diabetes mellitus, edema, pressure ulcers, hair loss, paresthesia, dermatitis, chronic obstructive pulmonary disease (COPD), atelectasis, pulmonary brosis, dyspnea, acute pharyngitis, acute bronchitis, dementia, myopathies, cerebral ischemia, encephalopathy, cognitive problems, sleep disorders, headache, muscle weakness, bromyalgia, joint pain, acute kidney failure, cystitis, genitourinary problems, constipation, gastroparesis, abdominal pain, gastroesophageal re ux disease (GERD), heart failure, hypotension, pulmonary embolism, thromboembolism, abnormal heartbeat, chest pain, and anemia. We compiled this list based on both our previous study and evidence from other literature. 3,4,14 Any incident condition was de ned in the SARS-CoV-2 infected patients who had the condition from 31 days to 180 days after but not having the condition three years to seven days before their acute infection. See Supplementary Table 1 for the detailed code list.

Eligibility criteria and study population
We included adult patients aged 20 years or older with at least one SARS-CoV-2 polymerase chain reaction (PCR) or antigen laboratory test from March 1st, 2020, to November 30th, 2021. We further required at least one diagnosis code within three years to seven days before the index date (referred to as the baseline period), and at least one diagnosis code from 31 days to 180 days after the index date (referred to as the post-acute phase or follow-up period), to ensure that patients were connected to the healthcare system and were being observed during the study period. We followed each patient from 31 days after his/her index date until the day of the rst target outcome, documented death, the latest date of any documented records in the database, 180 days after the baseline, or the end of our observational window (December 31, 2021), whichever came rst. Two exposure groups were the SARS-CoV-2 infected group and the non-infected group. The SARS-CoV-2 infected group included patients with a positive SARS-CoV-2 PCR or antigen laboratory test. The index date of the infected group was de ned as the date of the rst documented positive PCR or antigen test. The non-infected group included patients whose SARS-CoV-2 PCR or Antigen tests were all negative throughout the entire study period with no documented COVID-19-related diagnoses at any time. The index date for patients in the non-infected group was de ned as the date of the rst negative PCR or antigen test. The association study and predictive modeling were conducted on the infected group. The non-infected group was used to rule out background associations that were not speci c to PASC. The patient inclusion and exclusion cascades were illustrated in Fig. 1.

Covariates
We collected clinical features in the baseline period (3 years to 1 week before lab-con rmed SARS-CoV-2 infection) and the severity of acute infection (1 week before to 30 days after lab-con rmed SARS-CoV-2 infection). Age was categorized into 20-39 years, 40-54 years, 55-64 years, 65-74 years, and 75 years and older groups. We set 55-64 as the reference group. Gender was grouped into female and male (reference). Only three patients in INSIGHT were identi ed as other/missing gender who were excluded. The race was categorized into Asian, Black or African American, White (reference), other or missing.
Ethnicity was grouped into Hispanic, not Hispanic (reference), and other/missing. We used the nationallevel area deprivation index (ADI) to capture the socioeconomic disadvantage of patients' residential neighborhood. 20 Larger ADI values indicate mode socioeconomically deprived status. Missing ADI value was imputed with median ADI per site. The ADI is a ranking from 1 to 100 with 1 and 100 representing the lowest and the highest level of disadvantage, respectively. We grouped ADI into ve categories and set the ADI category 1-20 as the reference group. Baseline healthcare utilization up to three years before the index date was measured according to their care setting. For each inpatient, outpatient, and emergency department encounter, we categorized each setting into 0 visit (reference group), 1 or 2 visits, and 3 or more visits, respectively. We also considered the infection periods, which were grouped into March 2020 -June 2020, July 2020 -October 2020, November 2020 -February 2021, March 2021 -June 2021, and July 2021 -November 2021. We set the rst wave of the pandemic from March 2020 to June 2020 in INSIGHT as the reference group. Of note, the third wave from July 2021 -November 2021 period was dominated by the Delta variant. Body mass index (BMI) was grouped according to the WHO classi cation, BMI < 18.5 as underweight, BMI 18.5-24.9 as normal weight (reference), BMI 25-29.9 as overweight, BMI > = 30 obese, and set missing value as a separate group. Smoking status was also considered, and categorized into never (reference), current, former, and missing. There is a signi cant missingness regarding smoking status (90.2% in COVID-positive patients and 49.8% in patients with at least one PASC) and we grouped these patients into the missing category.
A wide range of baseline clinical comorbidities were collected, based on a revised list of the Elixhauser comorbidities, conditions recommended by our clinician group, and related medications. Patients were ascertained as having a condition if they had at least two corresponding diagnoses documented during the baseline period. We also counted the number of pre-existing conditions and grouped them into no comorbidity (reference), 1, 2, 3, 4, and 5 or more. A detailed list of these pre-existing conditions and de ned reference categories were summarized in Extended Data Table 1.

Statistical analysis
For each potential PASC condition, we performed statistical analysis on its association with various covariates including the following steps. ( Step I) We built a separate multivariate Cox proportional hazard model on the EHR of patients who were infected by SARS-CoV-2 to assess associations of covariates and time to the rst incident event or censoring in the follow-up period (31-180 days after COVID-19 con rmation). The censoring event is de ned as the earliest event of documented death, loss of follow-up in the database, 180 days after the baseline, or the end of our observational window (December 31, 2021). Fully adjusted hazard ratios (aHR) of each covariate and target PASC event were estimated. ( Step II) We built another multivariate Cox proportional hazard model on the EHR of all patients regardless of their SARS-CoV-2 infection status. The model inputs include two parts. One is the set of covariates. The other is the set of interaction terms de ned as the product of each covariate and SARS-CoV-2 infection status (1 for SAR-CoV-2 infected patients and 0 for non-infected control patients) on the outcome condition. In this way, the coe cient of a particular covariate captured its association with the outcome condition for patients who were not infected by SARS-CoV-2, and the coe cient of its corresponding interaction term captured the "quantitative modi cations" of such association for patients who were infected by SARS-CoV-2. Fully adjusted hazard ratios of each covariate and interaction term were estimated on both infected and non-infected patients.
We identi ed a covariate to be a potential risk factor of a particular PASC condition if it satis ed the following three criteria: The adjusted hazard ratio (aHR) estimated from the infected patients in Step I were greater than 1; The p-value of the above aHR was smaller than 0.000562, which was corrected by the Bonferroni method for multiple testing; The aHR of the interaction term of the corresponding covariate should also be greater than 1 in Step II.
To build predictive models for each PASC condition, we examined different machine learning models in both survival analysis and binary classi cation settings. For the survival analysis setting, we used a multivariate Cox proportional hazard model with L2 norm regularization to predict the time to the outcome event. For the binary classi cation setting, the occurrence of the target event in the follow-up period was labeled as 1 and 0 otherwise. We used logistic regression with L2 norm regularization, gradient boosting machine with random forest base learner, and deep feed-forward neural network. For each of the abovementioned models, the best model was selected by grid search (see details in the following sensitivity analysis paragraph) a prede ned hyper-parameter space through repeated crossvalidation (ten times, ve folds). The concordance index (C-index) and the area under the receiver operating characteristic curve (AUROC) were used to evaluate survival prediction performance and binary prediction performance respectively. Both two measures range from 0 to 1 with 0.5 indicating random guess and 1 indicating perfect prediction. The 95% con dence interval of the nal performance was estimated by 1000-times bootstrapping performance on each of the testing datasets in repeated crossvalidation.

Strati ed analysis
The strati ed analysis was conducted by stratifying patients by their severity in the acute infection phase (hospitalized or non-hospitalized) and then performing statistical analysis within each stratum. The noninfected control patients were also strati ed according to the hospitalized or non-hospitalized during the 1 week before to 30 days after their index date, to capture background associations within each subgroup population.

Sensitivity analysis
To get robust conclusions, we conducted the following sensitivity analyses. For association analysis, we also used a univariate Cox model for each covariate adjusted for age, sex, and acute severity. We further tested the impact of lifting Step II of the statistical analysis on the identi ed risk associations.
For the predictive modeling, we also tested a different feature engineering method, which used the rst 3digits of ICD-10 codes and medication at the ingredient level to test to what extent PASC can be predicted in a data-driven manner. We selected 1,593 ICD-10 diagnosis codes, 2,309 drugs and 1,698 ICD-10 diagnosis codes, and 4,366 drugs from the INSIGHT dataset and OneFlorida + data, respectively. These ICD-10 diagnosis codes and medications were selected to construct the input feature vectors of the prediction model based on the signi cant difference (P-value less than 0.0001) between patients with positive and negative PASC conditions results. After the feature selection process, the selected ICD-10 diagnosis codes, medication, and collected baseline covariates were constructed to represent every PASC condition.
We also tested different machine learning predictive models in both the survival analysis setting and binary classi cation setting to validate the predictability of each PASC potentially impacted by different models. For the survival analysis setting, we tested Cox proportional model with L2-norm regularization. For the binary classi cation setting, we investigated three machine-learning models with different complexity. The rst one is the regularized logistic regression. We adopted the L2-norm penalty and searched the inverse of regularization strength from 10^-3 to 10^3 with 0.5 as the sampling step size. The second one is the gradient boosting machine with a random forest as the base learner. We searched hyperparameters from maximum depth (3,4,5), max number of leaves in one tree (10,20,30), and a minimal number of data in one leaf (30). The third one is the deep forward neural network. We used ReLU (Recti ed Linear Unit) activation function for the hidden layer, and search the hidden layers ((32,), (64,), (128,), (32,32), (64, 64), (128, 128)), and learning rate (0.001, 0.01, 0.1). For each of the above-mentioned models, the best model was selected by grid search of the corresponding hyperparameter space through repeated cross-validation (ten times, ve folds). In the repeated cross-validation process, we set one of the folds as the test set and the rest of the data as the training set. The C-index and the area under the receiver operating characteristic curve (AUROC) were used to measure the predictive performance in the survival setting and binary classi cation setting, respectively.

Validation analysis and generalizability
To get a generalizable conclusion, we further replicated the abovementioned association analyses and predictive analyses to the OneFlorida + cohort. The cohort selection and modeling strategies were the same as our primary analyses on the INSIGHT cohort.

Prediction Performance
We developed our primary results on the INSIGHT cohort and used the OneFlorida + cohort as a validation cohort. Both cohorts were collected from patients who has at least one PCR/antigen test for SARS-CoV-2 infection from March 2020 to November 2021, and the inclusion-exclusion cascade was provided in The current de nition of PASC in the RECOVER protocols is ongoing, relapsing, new symptoms, or other health effects occurring four or more weeks after the acute phase of SARS-CoV-2 infection. 21  We built a list of 89 covariates that are potentially associated with PASC based on a revised list of Elixhauser comorbidities, recommendations of our RECOVER clinician team, and the severity of acute infection of SARS-CoV-2. These covariates included basic demographics (e.g., age, gender, race, ethnicity), social-economic status in terms of Area Deprivation Index (ADI) 21 , healthcare utilization history, body mass index, the period of infection, comorbidities, and the care settings in acute phase including hospitalization and ICU admission. For each of the categorical covariates, we de ned its reference group the same as prior studies for acute SARS-CoV-2 infection (details see Method covariates section). 6 We built different machine learning models to predict the individual risk of encountering each incident condition using these covariates. The prediction performance of a regularized Cox model measured by the Concordance index (C-index) 23 with a 95% con dence interval was shown in Fig. 2 (results for other machine learning models are provided in the Sensitivity Analysis section). Figure 2 shows that different incident conditions were associated with heterogeneous predictive performance. Conditions such as dementia, malnutrition, stroke, non-speci c PASC (U099/B948), and kidney failure had a C-index > 0.8, in addition to other conditions such as myopathy, and pressure sores.
We noted that diabetes, thromboembolic disease, and COPD were moderately predictable, with a C-index > 0.7, and other conditions such as fatigue, anxiety disorders, and sleep disorders were less predictable, with a C-index < 0.6.
Associations between risk factors and speci c PASC conditions. Furthermore, we analyzed the associations between the covariates and the risk of developing any incident condition from our list. The unadjusted hazard ratio (HR) and fully adjusted hazard ratio (aHR) for each covariate. A covariate was identi ed as a potential risk factor for developing a particular condition if it satis ed the following three criteria: (1) the corresponding aHR of the covariate with respect to the target condition is larger than 1 when compared with the reference group (Method covariates section and Extended Data Table 1); (2) the association was statistically signi cant after multiple testing correction (p-Value < 0.000562); and (3) the associated risk was higher in SARS-CoV-2 infected patient population compared to the non-infected population. Note that criterion (3) is to guarantee the risk association we identi ed is not a common one that widely exists in patients without  Figure 3 depicted the associations between the identi ed risk factors and speci c PASC conditions, which we would further elaborate on as follows.
The severity of acute infection. Increased severity of the acute SARS-CoV-2 infection (according to the care settings) was associated with a higher risk of being diagnosed with new incident conditions in the post-acute period. Overall, a higher risk of getting any incident diagnosis was observed in patients who were hospitalized during the acute phase (1.29 (1.24-1.33)) or in ICU (1.40 (1.32-1.49)) compared to patients who were not hospitalized during the acute phase (as a reference group, see the Extended Data Table 1). Figure 3 further showed the associations between the acute phase severity and a range of potential PASC conditions. Speci cally, compared to non-hospitalized patients, the ICU patients showed a 4.7-fold higher risk of being diagnosed with myopathy, 2.5-fold higher risk of being diagnosed with pressure ulcers, 2.3-fold higher risk of being diagnosed with thromboembolism, 2.1-fold higher risk of being diagnosed with malaise and fatigue. In addition, patients who were hospitalized or admitted to ICU during the acute phase had a higher risk of being diagnosed with general PASC codes U099/B948, with 4.3-and 2.2-fold increases compared to non-hospitalized patients.
Age. Patients aged 75 or older showed an increased risk of being diagnosed with a wide range of potential PASC conditions in the post-acute infection phase, including dementia (5.8-fold higher), COPD Pre-existing conditions. As shown in Fig. 3, having one or more baseline conditions was associated with a higher risk of potential PASC diagnosis including malnutrition, uid disorders, anemia, and chest pain.
Speci cally, cancer patients showed increased risk in a broad list of post-acute conditions including malnutrition, atelectasis, fever, anemia, pulmonary brosis, constipation, and bromyalgia compared to those without cancer diagnoses at baseline. Patients having baseline chronic kidney disease showed an increased risk of being diagnosed with heart failure and anemia. Those with baseline cirrhosis showed a 3-fold-increased risk of gastroparesis, a 2-fold-increased risk of atelectasis, and a 1.8-fold-increased risk of anemia. Those with baseline coagulopathy showed a higher risk of thromboembolism and cognitive problems. Patients with end-stage renal disease showed a higher risk of COPD and malnutrition. Those with baseline mental health disorders exhibited a higher risk of dementia and anxiety disorders in the post-acute period. Parkinson's disease patients showed a 2.2-fold-increased risk of encephalopathy. Pregnant females showed a 2.4-fold increased risk of anemia in the post-acute period. Those with baseline pulmonary circulation disorder showed a 3.3-fold-increased risk of pulmonary embolism and a 1.9-fold-increased risk of heart failure. Patients with weight loss at baseline were at a higher risk of being diagnosed with pressure ulcers, COPD, constipation, and general PASC (with U099/B948) in the postacute phase.

Strati ed Risk analysis
We further conducted strati ed analyses to examine the associations between baseline factors and incident potential PASC conditions according to the care settings in the acute phase (hospitalized versus non-hospitalized). The same criteria on adjusted hazard ratio and statistical signi cance as we used in Fig. 3 were adopted here to identify potential risk associations, which were demonstrated in Fig. 4.

≥
Overall, certain demographic characteristics including older age, female, and black race, as well as baseline conditions including obesity and chronic kidney disease, were associated with increased risk of begin diagnosed with PASC in both non-hospitalized and hospitalized patients. There were also differences in these identi ed associations across the two different settings. Speci cally, for patients who were not hospitalized during acute infection, we observed that baseline arrythmia was associated with a 1.9-fold increased risk of an incident diagnosis of heart failure in the post-acute period, pregnancy was associated with 3.4-fold-increased risk of incident anemia, and patients with baseline sickle cell disease showed a 3.2-fold-increased risk of being diagnosed with anxiety disorder. However, these associations were not identi ed among patients who were hospitalized in the acute phase. For these patients, we observed that baseline pulmonary circulation disorder was associated with an 11.4-fold-increased risk of being diagnosed with pulmonary embolism, and baseline multiple sclerosis was associated with a 2.7fold-increased risk of being diagnosed with malaise and fatigue in the post-acute phase.

Sensitivity analysis
We have examined the impact of the criterion on requiring the identi ed association to be with a higher risk in SARS-CoV-2 infected patients compared to non-infected patients. Extended Data Fig. 1 depicted the identi ed associations after we lifted this requirement, i.e., we only require these associations to satisfy the adjusted hazard ratio and statistical signi cance constraints. From the gure, we observed that more associations have been identi ed compared to Fig. 3, and many of these associations may not be relevant to SARS-CoV-2 infection. Taking patients with pre-existing cancer as an example, they were associated with a higher risk of being diagnosed with uid disorders, acute kidney failure, thromboembolism, encephalopathy, edema, malaise, and fatigue in the post-acute period after SARS-CoV-2 infection. However, these associations can also be identi ed for non-infected cancer patients. Therefore, the excessive risk criterion is necessary for ltering out the associations that are not speci c to SARS-CoV-2 infection.
We also tested to what extent the predictability of incident potential PASC conditions is affected by different machine learning models. We investigated a range of machine learning models with different complexities, including regularized logistic regression models, gradient boosting machines, and feedforward deep neural networks. As shown in Extended Data Fig. 2, we observed little difference across the performance of these different models, and the heterogeneous predictability patterns were still observed, i.e., conditions that were di cult to predict in Fig. 2 were still with low predictive performance despite using more complex models.
Lastly, we studied if different feature engineering can impact the prediction results of different PASC conditions. Instead of using pre-de ned baseline comorbidities, we leveraged a data-driven approach by using the rst three digits of ICD-10 codes of all diagnoses and all medications in RxNorm codes at the ingredient level in the baseline period to predict PASC. We reported the predictive performance of different machine learning models using this large set of features in Extended Data Fig. 3, which does not show big differences compared to the performance in Extended Data Fig. 2 or Fig. 2, and the heterogeneous predictability patterns remain the same.

Validation by the OneFlorida + Cohort
To assess the generalizability of our ndings, we replicated our analyses on the OneFlorida + cohort as an independent validation. The OneFlorida + cohort included 22,341 adult patients with lab-con rmed SARS-CoV-2 infection and 177,010 non-infected as control patients (See inclusion cascade in Fig. 1). We summarized the prediction performance of different potential PASC conditions with regularized Cox model in Extended Data Fig. 4 and the identi ed risk associations in Extended Data Fig. 5. From Extended Data Fig. 4 we again observed the heterogeneous predictability of different conditions as has been observed in Fig. 2, and the more predictable conditions (with c-index > 0.8, such as malnutrition, COPD, dementia, and acute kidney failure) and less predictable (with c-index around or below 0.6, such as fatigue, anxiety, sleep disorders, and depression) remained the same. Similarly, the risk associations shown in Extended Data Fig. 4 are consistent with the risk associations shown in Fig. 3. Hospitalization and ICU admission in the acute infection phase were associated with a diverse set of incident diagnoses in the post-acute infection phase. We still observed the risk associations between older age and dementia (5.4-fold increased risk), female and hair loss (2.2-fold increased risk), black race, and diabetes (1.5-fold increased risk). Infection con rmation from July to November 2021 was associated with a 1.7-fold increased risk of being diagnosed with general PASC symptoms and signs (the U099/B948 ICD code).

Discussion
In this paper, we conducted a systematic study on the predictability of a wide range of potential PASC conditions as well as their associated risk factors using the EHR data from two large-scale PCORnet clinical research networks, INSIGHT, and OneFlorida+. Compared with existing research on this topic which was mostly based on patient-reported symptoms 12,25 , our study was based on routinely collected EHR data from large patient populations.
We investigated the predictability of different potential PASC diagnoses using patient demographics, prior conditions, and care settings in the acute phase. Different types of machine learning models, including linear models, tree-based models, and deep learning models were tested. Following prior research on The results from regularized Cox regression were summarized in Fig. 2, which suggested that different conditions were associated with different predictabilities in the INSIGHT cohort. Conditions such as stroke, heart failure, and kidney failure were more predictable. These conditions are with clear diagnostic criteria according to the underlying disease etiologies and are more likely to be severe COVID complications. Pressure ulcers was also highly predictable, but it was more likely due to prolonged hospital stay during the acute phase of SARS-CoV-2 infection 27 . General PASC symptoms and signs with the U099/B948 codes were also associated with good prediction performance, which is consistent with prior studies 28 . One potential reason was that these codes were relatively new, and the clinicians might be cautiously using them only when the symptoms and signs were typical. Conditions such as headache, dizziness, chest pain, joint pain, anxiety, and depressive disorders, were more di cult to predict. These conditions are most subjective to diagnose, more similar to patient-reported symptoms, and cannot be explained by alternative disease etiologies. The prediction performance obtained from more complex machine learning models did not make such differences, as evidenced by Extended Data Fig. 2. In addition, we have replicated the predictive modeling analysis on the OneFlorida + cohort, and the results summarized in Extended Data Fig. 4 were highly consistent with the conclusions obtained from the INSIGHT cohort.
With fully adjusted analysis, we examined the statistical associations between a broad list of covariates including demographics, pre-existing conditions, and severities in the acute phase of SARS-CoV-2 infection according to care settings and each potential PASC condition. For a covariate to be considered as a potential risk factor of a speci c condition, we required its corresponding adjust hazard ratio (aHR) to be larger than 1 and statistically signi cant. We further required the estimated aHR value to be larger in patients who were infected by SARS-CoV-2 compared to the non-infected patients, in this way associations that may not be attributed to COVID-19 can be ltered. Figure 3 and Extended Data Fig. 5 summarized the identi ed risk associations from the INSIGHT and OneFlorida + cohorts. Both gures showed that hospitalizations and admissions to ICU during the acute infection phase were associated with a broad set of incident conditions in the post-acute infection phase, including pressure ulcers, heart failure, acute kidney failure, COPD, etc., which suggested that these conditions could be related to either severe acute COVID complications or acute care processes. Older age ( > = 75 years) was also found to be a potential risk factor for many of these conditions. Black patients were at higher risk of being diagnosed with incident diabetes. These discoveries were consistent with the conclusions from prior studies 29,30 .
Other notable risk associations consistently identi ed from both cohorts include higher baseline comorbidity burden and uid disorder, baseline obesity and sleep disorder, as well as baseline end-stage renal disease and malnutrition. Some associations should be interpreted more cautiously. For example, baseline pulmonary circulation disorder was consistently identi ed as a risk pulmonary embolism, but the two conditions are highly correlated with each other, and this association could just be due to the ICD coding and grouping. Another example was baseline pregnancy and anemia, as anemia is the most common hematologic problem in pregnancy 31 . However, there were also studies suggesting that SARS-CoV-2 infection during pregnancy can further exacerbate iron de ciency anemia due to hyperin ammation during the acute infection phase 32 .
There were several strengths of our study. First, we studied a comprehensive set of associations between 89 factors and 44 incident PASC conditions in two large EHR cohorts. To our knowledge, this is one of the largest studies on predictive modeling and risk factor analysis for PASC. Second, we derived our primary results from INSIGHT and did a validation study on OneFlorida+, which validated the generalizability of our ndings. Third, we have tested the prediction performance of different machine learning models on both a narrow and broad list of covariates, which further validates the robustness of our conclusions.
Finally, we ruled out potential background associations by requiring the adjusted hazard ratio value of the identi ed association estimated from the patients who were infected by SARS-CoV-2 to be larger than the value estimated from patients who are not infected by SARS-CoV-2.
Our study had several limitations. Our analysis was based on EHR data, which would miss the information from patients who did not visit the hospitals within the CRNs. We only considered newly incident individual conditions in the post-acute period but did not explore conditions that were prolonged, worsened, or relapsed before and after COVID-19 infection, as well as condition clusters or subphenotypes. Vaccine information was not incorporated in our study due to its incompleteness in the EHR records, and we are working on adding other data sources (e.g., state registry data) to make the information more robust. In addition, our analyses did not cover the recent Omicron wave due to the availability of the data.
In conclusion, we used two large-scale clinical research networks, INSIGHT and OneFlorida + to identify risk factors associated with newly incident PASC conditions and to develop predictive models to identify those who are at risk of these conditions. Our results highlight that several predictive PASC diagnoses are associated with severity in the acute phase. However, less predictable PASC diagnoses represent an ongoing challenge that may not respond to other measures that decrease the severity of acute COVID.